My nationality is China. I came to Canada to study several years ago and what impressed me the most is its beautiful natural environment. With its scenic views, mild climate, and friendly people, Vancouver is well-known around the world as both a popular tourist attraction and one of the best places to live (https://vancouver.ca/news-calendar/our-city.aspx). Its diverse natual environment,such as mountains, oceans and diverse wild animal, has attarcted a lot of immigrants to live here. One of the most important features, vancouver trees, also contributes a lot to the amazing beauty of Vancouver city. I am curious about how trees are distributed in different vancouver neighbourhood, what kind of genus they are, when they were planted, and so on. To answer these questions, a exploratory data analysis and visualization on the vancouver tree dataset would be necessory.
In this report, I will explore the vancouver tree distribution by analyzing a subset of Vancouver Street Trees dataset (https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name). The data were obtained from The city of Vancouver's Open Data Portal and follows an Open Government Licence – Vancouver (https://opendata.vancouver.ca/pages/licence/). The data analysis and visualization of this dataset will give us more details of vancouver trees (subset is from https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv).
By exploring this dataset, I am interested in answering the following questions:
For the final dashboard, I would like to present the tree details of specific/selected vancouver neighbourhood, such as the top 10 genus, tree diameter and height distribution, planted year, and street side name.
# Import libraries needed for EDA
import altair as alt
import pandas as pd
import numpy as np
# alt.data_transformers.enable('data_server')
# Load the dataset and parse the 'date_planted' as date datatype
van_tree_df = pd.read_csv('small_unique_vancouver.csv', parse_dates=['date_planted'])
# Take a look at all the columns of the dataset
van_tree_df.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaT | 12.0 | ODD | PINUS | N | ... | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaT | 15.5 | ODD | AESCULUS | Y | ... | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
5 rows × 21 columns
First, let's take a look at the general information of this dataset and all the columns.
van_tree_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 5000 non-null int64 1 std_street 5000 non-null object 2 on_street 5000 non-null object 3 species_name 5000 non-null object 4 neighbourhood_name 5000 non-null object 5 date_planted 2363 non-null datetime64[ns] 6 diameter 5000 non-null float64 7 street_side_name 5000 non-null object 8 genus_name 5000 non-null object 9 assigned 5000 non-null object 10 civic_number 5000 non-null int64 11 plant_area 4950 non-null object 12 curb 5000 non-null object 13 tree_id 5000 non-null int64 14 common_name 5000 non-null object 15 height_range_id 5000 non-null int64 16 on_street_block 5000 non-null int64 17 cultivar_name 2658 non-null object 18 root_barrier 5000 non-null object 19 latitude 5000 non-null float64 20 longitude 5000 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(5), object(12) memory usage: 820.4+ KB
According to the information above, we are able to see that there are 5000 entries in total and 21 columns. The first column has no name. It is actually the index number of the original dataset (we are only analyzing the subset data). Therefore, this column will be dropped for further analysis.
# Drop the index column of original dataset
van_tree_df=van_tree_df.iloc[:,1:]
van_tree_df.head()
| std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | civic_number | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | 66 | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | 2323 | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | ROSS ST | ROSS ST | NIGRA | Sunset | NaT | 12.0 | ODD | PINUS | N | 7855 | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | 6938 | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaT | 15.5 | ODD | AESCULUS | Y | 5295 | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
Next, let us take a look at the details of the each colum's information from website https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name. The columns details are summarized in Table 1.
| Column name | Datatype | Details |
|---|---|---|
| std_street | object | Street name of the site at which the tree is associated with |
| on_street | object | The name of the street at which the tree is physically located on |
| species_name | objec | Species name |
| neighbourhood_name | object | City's defined local area in which the tree is located. |
| date_planted | datetime | The date of planting in YYYYMMDD format. |
| diameter | float | DBH in inches (DBH stands for diameter of tree at breast height) |
| street_side_name | object | The street side which the tree is physically located on (Even, Odd or Median (Med)) |
| genus_name | object | Genus name |
| assigned | object | Indicates whether the address is made up to associate the tree with a nearby lot (Y=Yes or N=No) |
| civic_number | int | Street address of the site at which the tree is associated with |
| plant_area | object | B = behind sidewalk, G = in tree grate, N = no sidewalk, C = cutout, a number indicates boulevard width in feet |
| curb | object | Curb presence (Y = Yes, N = No) |
| tree_id | int | Numerical ID |
| common_name | object | Common name |
| height_range_id | int | 0-10 for every 10 feet (e.g., 0 = 0-10 ft, 1 = 10-20 ft, 2 = 20-30 ft, and10 = 100+ ft) |
| on_street_block | int | The street block at which the tree is physically located on |
| cultivar_name | object | Cultivar name |
| root_barrier | object | Root barrier installed (Y = Yes, N = No) |
| latitude | float | Location latitude |
| longitude | float | Location longitude |
# Let us print out the summarized information for numeric columns
van_tree_df.describe()
| diameter | civic_number | tree_id | height_range_id | on_street_block | latitude | longitude | |
|---|---|---|---|---|---|---|---|
| count | 5000.000000 | 5000.000000 | 5000.000000 | 5000.00000 | 5000.000000 | 5000.000000 | 5000.000000 |
| mean | 12.340888 | 2975.707600 | 128682.584600 | 2.73440 | 2960.227000 | 49.247349 | -123.107128 |
| std | 9.266600 | 2078.580429 | 75412.260406 | 1.56957 | 2086.861052 | 0.021251 | 0.049137 |
| min | 0.000000 | 2.000000 | 36.000000 | 0.00000 | 0.000000 | 49.202783 | -123.220560 |
| 25% | 4.000000 | 1300.500000 | 61321.500000 | 2.00000 | 1300.000000 | 49.230152 | -123.144178 |
| 50% | 10.000000 | 2639.000000 | 130130.500000 | 2.00000 | 2600.000000 | 49.247981 | -123.105861 |
| 75% | 18.000000 | 4123.000000 | 191332.000000 | 4.00000 | 4100.000000 | 49.263275 | -123.063484 |
| max | 71.000000 | 9113.000000 | 270750.000000 | 9.00000 | 9100.000000 | 49.293930 | -123.023311 |
This dataset has 7 numerical columns, including diameter, civic_number, tree_id, height_range_id, on_street_block, latitude and longitude. The rest of columns are categorical except date_planted is temporal.
# Let us print out the summarized information for categorical and temporal columns
van_tree_df.describe(exclude=[np.number],datetime_is_numeric=True)
| std_street | on_street | species_name | neighbourhood_name | date_planted | street_side_name | genus_name | assigned | plant_area | curb | common_name | cultivar_name | root_barrier | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 5000 | 5000 | 5000 | 5000 | 2363 | 5000 | 5000 | 5000 | 4950 | 5000 | 5000 | 2658 | 5000 |
| unique | 603 | 607 | 171 | 22 | NaN | 4 | 67 | 2 | 38 | 2 | 361 | 176 | 2 |
| top | W 13TH AV | CAMBIE ST | SERRULATA | Renfrew-Collingwood | NaN | ODD | ACER | N | 10 | Y | KWANZAN FLOWERING CHERRY | KWANZAN | N |
| freq | 52 | 49 | 463 | 384 | NaN | 2554 | 1218 | 4564 | 736 | 4593 | 383 | 383 | 4679 |
| mean | NaN | NaN | NaN | NaN | 2003-09-06 04:03:08.912399488 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | NaN | 1989-10-31 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | NaN | 1997-11-06 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | NaN | 2003-02-12 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | NaN | 2009-11-17 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | NaN | 2019-05-07 00:00:00 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
In this dataset, majority of the columns have 5000 entries, while date_planted, plant_area, cultivar_name has less entries, whose entries are 2363, 4950, 2658, respectively. Since date_planted and cultivar_name has only half of the entries of the total entries, I will keep the date_planted because it is one of the variables of my interest, but I will eliminate cultivar_name. Also, I will drop the NaN values of plant_area for further analysis. Last, I will drop some columns of no interest, including std_street, on_street, assigned, civic_number.
# Drop the columns of no interest and 'cultivar_name'
tree_df=van_tree_df.drop(columns=['std_street','on_street','assigned','civic_number','cultivar_name'])
# Drop the NaN rows in 'plant_area'
tree_df=tree_df.dropna(subset=['plant_area'])
tree_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 4950 entries, 0 to 4999 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species_name 4950 non-null object 1 neighbourhood_name 4950 non-null object 2 date_planted 2328 non-null datetime64[ns] 3 diameter 4950 non-null float64 4 street_side_name 4950 non-null object 5 genus_name 4950 non-null object 6 plant_area 4950 non-null object 7 curb 4950 non-null object 8 tree_id 4950 non-null int64 9 common_name 4950 non-null object 10 height_range_id 4950 non-null int64 11 on_street_block 4950 non-null int64 12 root_barrier 4950 non-null object 13 latitude 4950 non-null float64 14 longitude 4950 non-null float64 dtypes: datetime64[ns](1), float64(3), int64(3), object(8) memory usage: 618.8+ KB
Now the dataset is ready for visualization.
# Drop the NaN values of 'data_planted' for analysis
tree_date_df=tree_df.dropna(subset=['date_planted'])
# Create 'Year' and 'Month' columns from date_planted
tree_date_df=tree_date_df.assign(Year=tree_date_df['date_planted'].dt.year.astype(int))
tree_date_df=tree_date_df.assign(Month=tree_date_df['date_planted'].dt.month_name())
tree_date_df.head()
| species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | root_barrier | latitude | longitude | Year | Month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | N | 49.252711 | -123.106323 | 2000 | February |
| 1 | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | N | 49.256350 | -123.158709 | 1992 | February |
| 3 | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | N | 49.220839 | -123.036721 | 1999 | November |
| 5 | PERSICA | West End | 2012-04-05 | 3.0 | EVEN | PARROTIA | C | Y | 233622 | VANESSA PERSIAN IRONWOOD | 1 | 1100 | N | 49.281906 | -123.133076 | 2012 | April |
| 7 | OFFICINALIS | Kensington-Cedar Cottage | 2001-04-02 | 3.0 | EVEN | MAGNOLIA | N | Y | 187792 | CHINESE MAGNOLIA | 2 | 3700 | N | 49.251127 | -123.071912 | 2001 | April |
# Plot a bar chart to view the numbers of trees planted in different years and different month
year_chart=(alt.Chart(tree_date_df)
.mark_bar()
.encode(
alt.X('Year:N', title='Tree planted year'),
alt.Y('count()',title='Number of trees',axis=alt.Axis(grid=False)),
alt.Color(value='green'),
tooltip='count()')
.properties(title='Number of trees planted in different year'))
month_order=['January','February','March','April','May','June','July','August','September','October','November','December']
month_chart=(alt.Chart(tree_date_df)
.mark_bar().encode(
alt.X('Month:O', title='Tree planted month',sort=month_order),
alt.Y('count()',title='Number of trees', axis=alt.Axis(grid=False)),
alt.Color(value='pink'),
tooltip=('count()'))
.properties(title='Number of trees planted in different month'))
date_chart=year_chart|month_chart
date_chart
From Figure 1, it seems that from year 1989 to 1996, the number of trees planted is steadily increasing. Later from 1996 to 2013, the planted tree numbers fluctuates slightly and then starts dropping significantly from year 2013 to 2016. After that, the planted tree number starts slightly increasing again. For the number of planted trees for different month, Februry is the month with the most planted trees, while July and August have the minimum planted trees.
Usually, the tree with larger diameter is higher in height. Is this true for the vancouver tree dataset? Let's take a look at the relationship between the diameter and height_range_id.
# Creat a scatterplot for 'diameter' and 'height_range_id', set the count() for color channel
diameter_height_scatterplot=(alt.Chart(tree_df)
.mark_circle()
.encode(
alt.X('height_range_id:Q', title='Tree height range id'),
alt.Y('diameter:Q', title='Tree diameter (inch)'),
alt.Color('count()', title='Number of trees'))
.properties(title='Relationship of tree diameter VS tree height range id'))
diameter_height_scatterplot
From the scatterplot, we are able to see that the tree height and tree diameter has positive relationship, indicating that tree with larger diameter usually has higher height range id. However, this is a just overall trend, not apply to every single point. Therefore, I would like to do a boxplot to reveal more statistics and also add a line chart of mean diameter value to the scatterplot.
# Creat a boxplot for 'diameter' and 'height_range_id'
diameter_height_boxplot=(alt.Chart(tree_df)
.mark_boxplot()
.encode(
alt.X('height_range_id:Q', title='Tree height range id'),
alt.Y('diameter:Q', title='Tree diameter (inch)'))
.properties(title='Relationship of tree diameter VS tree height range id'))
# Creat a line chart using mean value of 'diameter' and add to scatterplot
diameter_height_lineplot=(alt.Chart(tree_df)
.mark_line(color='red')
.encode(
alt.X('height_range_id:Q', title='Tree height range id'),
alt.Y('mean(diameter):Q', title='')))
# Combine the scatterplot and line chart together, lay out with boxplot vertically
(diameter_height_lineplot + diameter_height_scatterplot)|diameter_height_boxplot
From Figure 3, it seems that the overall trend of relatishop between tree diameter and height range id (reflected by the mean and median values) is positive, except there is a slight decrease in diameter from tree height range id 8 to 9 (this could due to less datapoints).
Only knowing the relationship between the tree diameter and height is not enough for this report. I am more interested in the tree details of different neighbourhood. Therefore, I would like to present the tree diameter and height range id in rugplot. Add widget to select different neibourhood and genus, to see their tree diameter and height. Also, I am curious about that for specific/selected neighbourhood and genus, what is the number of trees of different street_side_name and when they were planted (tree planted year). Therefore, I created the following plot (Figure 4) to answer these questions and add it to the final dashboard.
neighbourhood = tree_df['neighbourhood_name'].unique()
dropdown_neighbourhood = alt.binding_select(name='Neighbourhood Name', options=neighbourhood)
genus = tree_df['genus_name'].unique()
dropdown_genus=alt.binding_select(name='Genus', options=genus)
select_neighbourhood_genus=alt.selection_single(fields=['neighbourhood_name','genus_name'], bind= {'neighbourhood_name': dropdown_neighbourhood, 'genus_name': dropdown_genus})
tree_diameter = (alt.Chart(tree_df)
.mark_tick()
.encode(
alt.X('diameter:Q', title='Diameter (inches)', scale=alt.Scale(domain=(0, 80))),
color=alt.condition(select_neighbourhood_genus, alt.value('blue'), alt.value('')))
.add_selection(select_neighbourhood_genus))
tree_height = (alt.Chart(tree_df)
.mark_tick()
.encode(
alt.X('height_range_id:Q', title='Height range ID', scale=alt.Scale(domain=(0, 9))),
color=alt.condition(select_neighbourhood_genus, alt.value('orange'), alt.value('')))
.add_selection(select_neighbourhood_genus))
street_side_barplot = (alt.Chart(tree_df)
.transform_filter(select_neighbourhood_genus)
.mark_bar()
.encode(
alt.X('street_side_name:N', title='Street side name'),
alt.Y('tree_count:Q', title='Number of trees'),
alt.Color(value='green'),
tooltip=[alt.Tooltip("tree_count:Q", title="Number of trees")])
.transform_aggregate(
tree_count='count()',
groupby=['street_side_name'])
.add_selection(select_neighbourhood_genus))
year_barplot = (alt.Chart(tree_date_df)
.transform_filter(select_neighbourhood_genus)
.mark_bar()
.encode(
alt.X('Year:N', title='Tree planted year'),
alt.Y('tree_count:Q', title='Number of trees'),
alt.Color(value='navy'),
tooltip=[alt.Tooltip("tree_count:Q", title="Number of trees")])
.transform_aggregate(
tree_count='count()',
groupby=['Year'])
.add_selection(select_neighbourhood_genus))
tree_detail_title = 'Tree details of different neighbourhood & genus'
tree_detail_plot = ((tree_diameter.properties(height=50) & tree_height & street_side_barplot.properties(height=100, width=200))| year_barplot.properties(width=500)).properties(title=alt.TitleParams(tree_detail_title, anchor='middle'))
tree_detail_plot
By selecting different neighbourhood and genus, we are able to see for specific genus in specific neighbourhood, what is its diameter and height range id, what is its street side number and how many of it has been planted over the year.
I would like to find out the number of trees of different neighborhood and tree genus by making the following barplots.
categorical_columns=['neighbourhood_name','genus_name']
repeat_plot = (alt.Chart(tree_df)
.mark_bar()
.encode(
alt.X('count()', title='Number of trees'),
alt.Y(alt.repeat(), type='nominal', title='',sort='-x'),
alt.Color(value='navy'),
tooltip=alt.Tooltip("count()", title="Number of trees"))
.properties(width=200, height=800)
.repeat(categorical_columns))
repeat_plot.properties(title=alt.TitleParams('Number of trees of different neighbourhood and different genus', anchor='middle'))
The top 5 neighbourhood with most tree number are Kensington_Cedar Cottage, Renfrew_Collingwood, Hastings_Sunrise, Dunbar_Southlands, Sunset. The top 5 tree genus planted are ACER, PRUNUS, TILIA, FRAXINUS, QUERCUS. How these top 5 genus trees distributed in different neighbourhood? Let us analyze this in the next question.
# Filter the data only including top 5 tree genus
filtered_genus_df=tree_df.query("genus_name == ['ACER', 'PRUNUS', 'TILIA', 'FRAXINUS', 'QUERCUS']")
filtered_genus_df
# Plot the bar chart and add tooltip for tree count
neighbourhood_genus_plot=(alt.Chart(filtered_genus_df)
.mark_bar()
.encode(
alt.X('count()', title='Number of trees'),
alt.Y('genus_name:N', title=''),
alt.Color('genus_name', scale=alt.Scale(scheme='set3')),
tooltip='count()').properties(width = 200).facet('neighbourhood_name', columns=5))
neighbourhood_genus_plot.properties(title=alt.TitleParams('Number of trees of Top 5 genus in different neighbourhood', anchor='middle'))
From Figure 6, we are able to see that among the 5 genus, ACER and PRUNUS are the top 2 genus planted for almost all the neighbourhoods (except Downtown). The number of trees of these 5 genus is different for different neighbourhood.
Even though Figure 6 can give us a lot of information of top 5 tree gunus of different neighbourhood but it is not flexible. What if I want to know the top 10 genus? What if I am only interested in a specific/selected neighbourhood, such as West End? In this case, there is no need to show data for other neighbourhoods. Therefore, I have imporved the plot by applying selection features as follows.
click_1 = alt.selection_single(fields=['neighbourhood_name'])
neighbourhood_barplot = (alt.Chart(tree_df)
.mark_bar().encode(
alt.X('neighbourhood_name:N', title='Neighbourhood name'),
alt.Y('count()',title='Number of Trees'),
alt.Color('neighbourhood_name:N', scale=alt.Scale(scheme='set3'), legend=None),
opacity=alt.condition(click_1, alt.value(1), alt.value(0.1)),
tooltip='count()')
.add_selection(click_1))
genus_barplot = (alt.Chart(tree_df)
.transform_filter(click_1)
.mark_bar()
.encode(
alt.X('tree_count:Q', title='Number of Trees', scale=alt.Scale(domain=(0, 1300))),
alt.Y('genus_name:N', sort='-x', title='Genus name'),
alt.Color(value='purple'),
tooltip=[alt.Tooltip("tree_count:Q", title="Number of trees")])
.transform_aggregate(
tree_count='count()',
groupby=['genus_name'])
.transform_window(
rank='rank(tree_count)',
sort=[alt.SortField('tree_count', order='descending')])
.transform_filter(alt.datum.rank <=10)
.add_selection(click_1))
neighbourhood_genus_title = 'Number of trees of different genus for different neighbourhood'
neighbourhood_genus_plot= (neighbourhood_barplot.properties(height=200) | genus_barplot).properties(title = alt.TitleParams(neighbourhood_genus_title, anchor='middle'))
neighbourhood_genus_plot
From Figure 7, we are able to find the top 10 genus (in descending order) of one specific/selected neighbourhood by clicking on the barchart. Also, I have added interactive feature to get the number of trees of different genus. This is a more convenient and efficient way to get the required information. We will include the Figure 7 in the final dashboard.
The main purpose of this dataset analysis and visualization is to explore detailed information about vancouver trees. From the previous analysis, there are some interesting points I have found to answer the questions mentioned at beginning of this report.
First, it seems that the number of trees planted in different year and month are quite different (Figure 1). The number of trees planted steadily increases from year 1989 to 1996. Later from 1996 to 2013, the planted tree number fluctuates slightly and then starts dropping significantly from year 2013 to 2016. After that, the planted tree number starts slightly increasing again. Most trees were planted in Februry, while least were planted in July and August. It is recommended to plant trees in the rain season to ensure the survival of saplings especially in the first few months after they are planted (https://essc.org.ph/content/view/132/). This is maybe the reason we see the number of trees starts increasing from October, then peaks in February, and slows down from May, which follows the vancouver average precipitation trend (https://weather-and-climate.com/average-monthly-precipitation-Rainfall,vancouver,Canada).
From common sense, trees with larger diameters usually has higher height (positive correlation). This is supported by our scatterplot and boxplot (Figure 3). However, we do observe a slight decrease in diameter from tree height range id 8 to 9. This could be due to less datapoints. Another reason could be the way to present the tree height (based on range id, not actual height). To improve this or get more confirmed result, collecting more data and present data of real tree height (measured in inch) would be helpful.
From Figure 5 and 6, we are able to see that different vancouver neightbourhood has different number of trees and different genus. The top 5 neighbourhood with most tree number are Kensington_Cedar Cottage, Renfrew_Collingwood, Hastings_Sunrise, Dunbar_Southlands, Sunset. The top 5 tree genus planted are ACER, PRUNUS, TILIA, FRAXINUS, QUERCUS. By clicking the interactive plot (Figure 7), we are able to look into more details of top ten genus of different neighbourhood.
For the final interactive dashboard, the goal is to build a tool for goveronment or people who care about understanding their community/neighbourhood tree distribution/details. By using this dashboard, more detailed information about trees, such as genus, diameter & height, tree planted year and street side name, will be presented in an more convenient and efficient way.
This data visualization helps me answer all my questions and the results meet my expectation. Other information I would like to explore is to dig deeper into the tree species and common name of different neighouborhood. In addition, the current dataset just includes the tree numbers, not population of neighbourhood. Neighbourhood with more population could plant more trees. If the dataset could have included the population of the neighbourhood, it would be helpful to understand the number of trees/person, which will give us a better idea about how the community has done regarding the tree planting.
Now, I am ready to make the final dashboard. For the dashboard, I would like to combine Figure 4 and Figure 7 together. So it shows all the information about the tree details for different vancouver neighbourhood. The dashboard is coded as follows and show in Figure 9.
dashboard_title =alt.TitleParams(
'Vancouver trees dataset analysis and visualization',
subtitle = ['What is the tree details for different neighbourhood?'],
fontSize=30, subtitleFontSize = 20, align ='center', anchor='middle')
dashboard_plot = (neighbourhood_genus_plot & tree_detail_plot).properties(title=dashboard_title)
dashboard_plot